6 Interactions

So far we’ve only created features from single variables but what about the effects of two variables together? Would it help our models if we added an interaction effect between for example HomePlanet and Destination to create the new feature HomeDestination? What about other such interactions?

In the previous competition for the Titanic of 1912, the sex of a passenger mattered (women more likely to survive) and the ticket class mattered (first class more likely to survive) but the interaction “woman in first class” had an almost 100% of survival and this interaction improved the model (if I remember correctly). We want to discover if such interactions exist between the variables that we have for our space odyssey.

Of course, we won’t know the extent of the improvement until we test the interaction effects in various models. Some models inherently discover interactions (like tree-models) and the addition of interaction effects might not matter while it might matter for others.

6.1 Visual exploration of interactions

Since we only have a few categorical variables, let’s visualize all possible interactions. Here I use the default glm logistic regression model where the formula: Transported ~ (.)^2 that amounts to Outcome ~ Variable_1 + Variable_2 + Variable_1 x Variable_2

plot_simple_int <- function(df, v1, v2) {
  tmp_vars <- c("Transported", v1, v2)
  tmp_model <- glm(Transported ~ (.)^2, data = df[, tmp_vars], family = binomial())
  p <- interactions::cat_plot(tmp_model, pred = {{ v1 }}, modx = {{ v2 }}, geom = "line", colors = c25,
                       main.title = paste(v1, "and", v2)) +
    theme(text = element_text(size = 40), plot.title = element_text(size = 40))
  return(p)
}

preds_cat <- c("CryoSleep", "HomePlanet", "Destination", "VIP", "Deck", "Side")
pairs_cat <- combn(preds_cat, 2, simplify = FALSE)
c25 <- c("dodgerblue2", "#E31A1C", "green4", "#6A3D9A", "#FF7F00", "black", "gold1", "skyblue2", "#FB9A99", "palegreen2", "#CAB2D6",
  "#FDBF6F", "gray70", "khaki2", "maroon", "orchid1", "deeppink1", "blue1", "steelblue4", "darkturquoise", "green1", "yellow4",
  "yellow3", "darkorange4", "brown") # Had to add extra colours because I couldn't get `cat_plot` to work with defaults

save_plot <- function(p, i) {
  ggsave(filename = paste0("PairInt", i, ".png"), plot = p)
}

# cat_pair_int_plots <- pairs_cat %>%
#   map(.x = ., .f = \(vp) plot_simple_int(train6, vp[1], vp[2]))
# 
# cat_plots <- walk2(.x = cat_pair_int_plots, .y = seq_along(cat_pair_int_plots), .f = save_plot)
# cat_slick <- map(1:length(cat_pair_int_plots), .f = \(i) paste0("PairInt", i, ".png"))
# save(cat_slick, file = "Plots cat.RData")
load("Plots cat.RData")

slickR::slickR(cat_slick, height = "480px", width = "672px") +
  slickR::settings(slidesToShow = 1, dots = TRUE)

Figure 6.1: Interaction effects for pairs of categorical variables and against the response.

Parallel lines indicate no significant interaction effects while lines that cross indicate a potential for a significant interactions. I’ll highlight a few interactions below.

plot_simple_int(train6, "CryoSleep", "HomePlanet")
Interaction between CryoSleep and HomePlanet

Figure 6.2: Interaction between CryoSleep and HomePlanet

The interaction between CryoSleep and HomePlanet suggests that passengers from Earth are less likely to be transported when in cryosleep which suggests that the interaction CryoSleep & HomePlanet could be useful.

plot_simple_int(train6, "Deck", "Side")
Interaction between Deck and Side

Figure 6.3: Interaction between Deck and Side

The interaction between Deck and Side, however, seems to show only a minor effect, if any.

Figures 6.4 and 6.5 below show all the interaction affects between both numerical and categoriacal variables.

plot_simple_int2 <- function(df, v1, v2) {
  tmp_vars <- c("Transported", v1, v2)
  tmp_model <- glm(Transported ~ (.)^2, data = df[, tmp_vars], family = binomial())
  p <- interactions::interact_plot(tmp_model, pred = {{ v1 }}, modx = {{ v2 }}, geom = "line", colors = c25,
                       main.title = paste(v1, "and", v2)) +
    scale_x_log10() +
    theme(text = element_text(size = 40), plot.title = element_text(size = 40))
  return(p)
}

preds_num <- c("Age", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck", "CabinNumber", "LastNameAsNumber",
               "PassengerGroup", "CryoSleep", "HomePlanet", "Destination", "VIP", "Deck", "Side")
pairs_num <- combn(preds_num, 2, simplify = FALSE)
pairs_num <- pairs_num[1:90]

save_plot2 <- function(p, i) {
  ggsave(filename = paste0("PairIntNum", i, ".png"), plot = p)
}

# num_pair_int_plots <- pairs_num %>%
#   map(.x = ., .f = \(vp) plot_simple_int2(train6, vp[1], vp[2]))
# 
# int_plots_slick <- walk2(.x = num_pair_int_plots, .y = seq_along(num_pair_int_plots), .f = save_plot2)
# int_plots_slick <- map(1:length(num_pair_int_plots), .f = \(i) paste0("PairIntNum", i, ".png"))
# save(int_plots_slick, file = "Plots num.RData")
load("Plots num.RData")

slickR::slickR(int_plots_slick[1:45], height = "480px", width = "672px") +
  slickR::settings(slidesToShow = 1, dots = TRUE)

Figure 6.4: Interaction effects for pairs of variables against the response. Part 1.

slickR::slickR(int_plots_slick[46:90], height = "480px", width = "672px") +
  slickR::settings(slidesToShow = 1, dots = TRUE)

Figure 6.5: Interaction effects for pairs of variables against the response. Part 2.

Based on what we see in the figures above, there seem to be large interaction effects between the numerical variables, especially between the amenities. We’ll keep this in mind as we explore further.

6.2 Interaction significance with only variable pairs and their interactions

One problem with the above visual approach is that we don’t know if these interaction effects are real - in the sense they represent some true relationship with the outcome - or if they’re so called false positives - that is, random effects that happened to be present in the training data. For example, does the fact that passengers from Earth seem less likely to be transported even when in cryo sleep reflect some true aspect of the spacetime anomaly that caused the transportation to another dimension or is this pattern just something that happened this time on pure chance? Another way to think of this is: if the Spaceship Titanic were to pass through this spacetime anomaly a thousand times, would the pattern persist?

To explore this further, we must validate some of these effects by cross validation. We will continue to use our simple glm model to model the response against each pair of variables and then compare it to a similar model that includes the interaction term and we’ll use the accuracy metric for comparison.

compare_models_1way <- function(a, b, metric = a$metric[1], ...) { # A customized compare_models function from caret that allows for
  mods <- list(a, b)                                               # a custom t.test adjustment in the diff-function
  rs <- resamples(mods)
  diffs <- diff(rs, metric = metric[1], adjustment = "none", ...)
  res <- diffs$statistics[[1]][[1]]
  return(res)
}

pair_model <- function(df, v1, v2) { # Model without interactions with only two variables
  tmp_vars <- c("Transported", v1, v2)
  set.seed(8584)
  m <- train(Transported ~ ., data = df[, tmp_vars], preProc = NULL, method = "glm", metric = "accuracy", trControl = ctrl)
  return(m)
}

pair_int_model <- function(df, v1, v2) { # Model with interactions with only two variables
  tmp_vars <- c("Transported", v1, v2)
  set.seed(8584)
  m <- train(Transported ~ (.)^2, data = df[, tmp_vars], preProc = NULL, method = "glm", metric = "accuracy", trControl = ctrl)
  return(m)
}

preds_cat <- c("CryoSleep", "HomePlanet", "Destination", "Age", "VIP", "Deck", "Side", "RoomService", "FoodCourt", "ShoppingMall",
               "Spa", "VRDeck", "CabinNumber", "LastNameAsNumber", "PassengerGroup")
pairs <- combn(preds_cat, 2, simplify = FALSE)
pairs_cols <- combn(preds_cat, 2, simplify = TRUE) %>%
  t() %>%
  as.data.frame()

ctrl <- trainControl(method = "repeatedcv", repeats = 5, classProbs = TRUE, summaryFunction = prSummary, allowParallel = TRUE)

# my_cluster <- makeCluster(detectCores() - 1, type = 'SOCK')
# registerDoSNOW(my_cluster)
# 
# no_int_mods <- pairs %>%
#   map(.x = ., .f = \(vp) pair_model(train6, vp[1], vp[2]))
# 
# int_mods <- pairs %>%
#   map(.x = ., .f = \(vp) pair_int_model(train6, vp[1], vp[2]))
# 
# stopCluster(my_cluster)
# 
# no_int_acc <- no_int_mods %>%
#   list_flatten(.) %>%
#   map(.x = ., .f = \(m) getTrainPerf(m)[1, "TrainPrecision"]) %>%
#   list_c(.)
# 
# diff_res <- map2(.x = no_int_mods, .y = int_mods, .f = \(m1, m2) compare_models_1way(m1, m2, alternative = "greater"))
# 
# save(no_int_acc, file = "No int acc.RData")
# save(diff_res, file = "Diff results pairs.RData")

load("No int acc.RData")
load("Diff results pairs.RData")

diff_res2 <-
  data.frame(Improvement = map_dbl(.x = diff_res, .f = \(est) est$estimate),
             Pvalue = map_dbl(.x = diff_res, .f = \(p) p$p.value)) %>%
  bind_cols(., Accuracy = no_int_acc) %>%
  bind_cols(pairs_cols, .)
diff_res2 %>% 
  filter(Pvalue <= 0.05) %>%
  arrange(desc(Improvement)) %>%
  slice(1:5)
#>           V1          V2 Improvement       Pvalue  Accuracy
#> 1        Spa      VRDeck 0.294366583 2.129008e-12 0.8105429
#> 2  CryoSleep Destination 0.269109101 2.497225e-13 0.6718461
#> 3       Deck   FoodCourt 0.038880545 9.711984e-15 0.5770264
#> 4 HomePlanet   FoodCourt 0.033884868 2.048566e-05 0.5754101
#> 5 HomePlanet CabinNumber 0.006942642 1.135601e-03 0.5765891

diff_res2 %>% 
  filter(Pvalue <= 0.05) %>%
  arrange(desc(Accuracy)) %>%
  slice(1:5)
#>            V1             V2  Improvement       Pvalue
#> 1 RoomService         VRDeck 0.0004863254 4.228540e-03
#> 2         Spa         VRDeck 0.2943665832 2.129008e-12
#> 3         Spa PassengerGroup 0.0023373657 2.409566e-10
#> 4        Side    RoomService 0.0002992205 5.560226e-04
#> 5 RoomService PassengerGroup 0.0019985596 3.514706e-11
#>    Accuracy
#> 1 0.8156325
#> 2 0.8105429
#> 3 0.8062987
#> 4 0.8058843
#> 5 0.8030301

The p-value test here tells us whether the effects of the interaction terms were large enough to be considered non-random. We’ve used a cutoff of 5% but without any adjustment. This can be problematic since the more tests we run with a 5% cutoff, the higher the chance of finding interactions that have an effect purely by chance (in fact, for our 105 pairs with a 5% cutoff, the chances of getting a false positive is practically guaranteed). We will look at some adjustment methods for p-values to take this into account later.

For now, we can conclude that there are 36 variable pairs that seem to improve model performance compared to a model without interaction effects. The two pairs that seem to add a significant improvement are Spa & VRDeck as well as CryoSleep & Destination. Notice also that the accuracy is impressively high for a model that only used RoomService and VRDeck (without interactions) as well as Spa and VRDeck. This suggests that the amenity variables will be very important.

6.3 Interaction significance with entire model and pairwise interactions

One weakness with the above approach is that we’ve looked at variable pairs and their interactions in absence of the other variables so we don’t know how these pairwise interactions contribute to a model with all the other variables. Do the improvements still persist or do some of them become correlated existing variables so that their effects lessen or even introduce noise to the data?

Let’s explore that.

my_split <- initial_split(train6, prop = 0.8)
int_train <- training(my_split) %>%
  mutate(across(.cols = c(PassengerGroup), .fns = as.integer)) %>%
  select(CryoSleep, HomePlanet, Destination, VIP, Deck, Side, Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, CabinNumber,
         LastNameAsNumber, PassengerGroup, Transported)

norm_ctrl <- trainControl(method = "repeatedcv", repeats = 5, classProbs = TRUE, summaryFunction = prSummary)

norm_rec <- recipe(Transported ~ ., data = int_train) %>%
  step_dummy(CryoSleep, HomePlanet, Destination, VIP, Deck, Side)

# my_cluster <- makeCluster(detectCores() - 1, type = 'SOCK')
# registerDoSNOW(my_cluster)
# 
# set.seed(8584)
# norm_m <- train(norm_rec, data = int_train, method = "glm", metric = "accuracy", trControl = norm_ctrl)
# 
# norm_m_acc <- getTrainPerf(norm_m)[1, "TrainPrecision"]
# 
# int_ctrl <- trainControl(method = "repeatedcv", repeats = 5, classProbs = TRUE, summaryFunction = prSummary)
# 
# int_function <- function(rec, f) {
#   ir <- step_interact(recipe = rec, terms = !!f)
#   return(ir)
# }
# # Map over pairs of vars to create int formulas
# int_form <- map(.x = pairs, .f = \(vp) formula(paste0("~starts_with('", vp[1], "'):starts_with('", vp[2], "')")))
# int_rec <- map(.x = int_form, .f = \(form) int_function(norm_rec, f = form))
# 
# train(int_rec[[65]], data = int_train, method = "glm", metric = "accuracy", trControl = int_ctrl)
# 
# set.seed(8584)
# int_m <- map(.x = int_rec, .f = \(r) train(r, data = int_train, method = "glm", metric = "accuracy", trControl = int_ctrl))
# 
# int_m_acc <- map_dbl(.x = int_m, .f = \(m) getTrainPerf(m)[1, "TrainPrecision"])
# 
# anova_res <- map2(.x = list(norm_m), .y = int_m, .f = \(m1, m2) anova(m1$finalModel, m2$finalModel, test = "Chisq")[2, 'Pr(>Chi)'])
# 
# diff_all_res <- map2(.x = int_m, .y = list(norm_m), .f = \(m1, m2) compare_models_1way(m1, m2, alternative = "greater"))
# 
# stopCluster(my_cluster)
# unregister()
# 
# diff_all_res2 <-
#   data.frame(Improvement = map_dbl(.x = diff_all_res, .f = \(est) est$estimate),
#              Resampled_Pvalue = map_dbl(.x = diff_all_res, .f = \(p) p$p.value),
#              Traditional_Pvalue = map_dbl(.x = anova_res, .f = \(p) p)) %>%
#   bind_cols(., No_Int_Accuracy = norm_m_acc) %>%
#   bind_cols(., With_Int_Accuracy = int_m_acc) %>%
#   bind_cols(pairs_cols, .)
# 
# save(diff_all_res2, file = "Results pairwise interactions with all other variables.RData")
load("Results pairwise interactions with all other variables.RData")

diff_all_res2_adj <- diff_all_res2 %>%
  mutate(Resampled_pvalue_FDR = p.adjust(Resampled_Pvalue, method = "fdr"),
         Traditional_pvalue_FDR = p.adjust(Traditional_Pvalue, method = "fdr"),
         Resampled_pvalue_Bon = p.adjust(Resampled_Pvalue, method = "bonferroni"),
         Traditional_pvalue_Bon = p.adjust(Traditional_Pvalue, method = "bonferroni"))
diff_all_res2_adj %>% 
  filter(Resampled_Pvalue <= 0.05) %>%
  select(V1, V2, Improvement, Resampled_Pvalue, With_Int_Accuracy) %>%
  arrange(desc(Improvement))
#>          V1         V2 Improvement Resampled_Pvalue
#> 1 CryoSleep       Deck 0.005597287     2.651042e-02
#> 2 CryoSleep HomePlanet 0.002697578     2.588115e-11
#>   With_Int_Accuracy
#> 1         0.8022111
#> 2         0.7992304

Now that we’ve added all variables to our models, we see that only two interaction effects are statistically significant without any p-value adjustment (column Resampled_Pvalue). Kuhn and Johnson write: > When the interactions that were discovered were included in a broader model that contains other (perhaps correlated) predictors, their importance to the model may be diminished. (…) This might reduce the number of predictors considered important (since the residual degrees of freedom are smaller) but the discovered interactions are likely to be more reliably important to a larger model.

diff_all_res2_adj %>% 
  filter(Resampled_pvalue_FDR <= 0.2) %>%
  select(V1, V2, Improvement, Resampled_pvalue_FDR, With_Int_Accuracy) %>%
  arrange(desc(Improvement))
#>          V1         V2 Improvement Resampled_pvalue_FDR
#> 1 CryoSleep HomePlanet 0.002697578         2.717521e-09
#>   With_Int_Accuracy
#> 1         0.7992304

diff_all_res2_adj %>% 
  filter(Resampled_pvalue_Bon <= 0.2) %>%
  select(V1, V2, Improvement, Resampled_pvalue_Bon, With_Int_Accuracy) %>%
  arrange(desc(Improvement))
#>          V1         V2 Improvement Resampled_pvalue_Bon
#> 1 CryoSleep HomePlanet 0.002697578         2.717521e-09
#>   With_Int_Accuracy
#> 1         0.7992304

If we were to adjust our values, only a single interaction effect remains that is CryoSleep and HomePlanet. The stricter adjustment is the Boneferroni which multiplies all the p-values with the number of tests (105 in our case) while a less strict adjustment is the FDR that sorts the p-values in ascending order and multiplies each one with the total number of tests divided by the p-value’s position (so that the lowest value gets multiplied by 105/1 in our case, the second lowest by 105/2 and so on). In the case of our interaction, it holds even with the stricter Bonferroni adjustment which implies that the effect is relatively strong.

Does this mean that we shouldn’t include the other interaction effects in our model? No. We can still try to add the ones we’ve discovered so far to a later stage where we will evaluate which features to select.

int_vars_very_imp <- diff_all_res2_adj %>% 
  filter(Resampled_pvalue_Bon <= 0.2) %>% 
  select(V1, V2) %>%
  mutate(ForFormula = str_c("starts_with('", V1, "'):starts_with('", V2, "')")) %>%
  mutate(RevFormula = str_c("starts_with('", V2, "'):starts_with('", V1, "')"))

int_vars_maybe_imp <- diff_all_res2_adj %>% 
  filter(Resampled_Pvalue <= 0.05) %>% 
  select(V1, V2) %>%
  mutate(ForFormula = str_c("starts_with('", V1, "'):starts_with('", V2, "')")) %>%
  mutate(RevFormula = str_c("starts_with('", V2, "'):starts_with('", V1, "')"))

6.4 All interaction effects

Let’s turn to a more comprehensive interaction analysis where we’ll use penalized regression to see whether any interactions prove beneficial to the model. Here we use the glmnet-function to first fit a model without interactions and then a model with interactions that we tune with different parameters.

The \(\lambda\) parameter controls the penalty that is applied to different variables to maximize accuracy while the \(\alpha\) controls the weight applied to each penalty function so that \(\alpha = 1\) signifies a Lasso penalty where the penalty is applied to the absolute of the regression coefficient while \(\alpha = 0\) signifies a Ridge penalty where the penalty is applied to the squared coefficient. In our case, the glmnet model allows us to tune for the mix of these factors that gives the best accuracy.

df_pen <- train6 %>%
  select(-c(Cabin, Name, LastName, PassengerCount, HomePlanetsPerGroup)) %>%
    mutate(across(.cols = c(PassengerGroupSize, tidyselect::ends_with("PerGroup")), .fns = as.integer))

set.seed(8584)
my_pen_split <- initial_split(df_pen, prop = 0.8)
pen_train <- training(my_pen_split)
my_pen_folds <- vfold_cv(pen_train, v = 10, repeats = 5)

my_vars <- data.frame(Variables = names(pen_train)) %>%
  mutate(Roles = if_else(Variables == "PassengerId", "id", "predictor"),
         Roles = if_else(Variables == "Transported", "outcome", Roles))

pen_rec <- recipe(x = pen_train, vars = my_vars$Variables, roles = my_vars$Roles) %>%
  step_normalize(Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, TotalSpent, CabinNumber, LastNameAsNumber, PassengerGroup,
                 TotalSpentPerGroup) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())

my_acc <- metric_set(accuracy)
pen_ctrl <- control_grid(verbose = TRUE, save_pred = TRUE, save_workflow = TRUE)
pen_grid <- expand.grid(mixture = c(0.2, 0.6, 1), penalty = c(1e-04, 2e-04, 3e-04, 4e-04, 5e-04))

glmnet_mod <- logistic_reg(penalty = tune(), mixture = tune()) %>%
  set_engine("glmnet")

# my_cluster <- makeCluster(detectCores() - 1, type = 'SOCK')
# registerDoSNOW(my_cluster)
# 
# system.time({
#   set.seed(8584)
#   pen_tune <- glmnet_mod %>%
#     tune_grid(pen_rec, resamples = my_pen_folds, metrics = my_acc, control = pen_ctrl, grid = pen_grid)
# })
# 
# save(pen_tune, file = "Penalized regression without interactions.RData")
# 
# stopCluster(my_cluster)
# unregister()

load("Penalized regression without interactions.RData")
pen_best <- fit_best(pen_tune)
pen_coef <- pen_best %>%
  tidy() %>%
  filter(estimate != 0) %>%
  filter(term != "(Intercept)") %>%
  pull(term)

show_best(pen_tune, metric = "accuracy", n = 20) %>%
  mutate(mixture = as.factor(round(mixture, 2))) %>%
  ggplot(aes(x = penalty, y = mean, label = mixture, colour = mixture)) +
  geom_line() +
  geom_point() +
  labs(title = "Tune results without interactions", x = "Lambda penalty", y = "Resample accuracy", colour = "Alpha")
Tuning results for a Lasso/Ridge regression without interaction effects.

Figure 6.6: Tuning results for a Lasso/Ridge regression without interaction effects.

Without any interaction effects, the regression model favoured the pure Lasso regression (\(\alpha = 1\)) with the penalty \(\lambda = 0.0002\). The most important variables (with the highest coefficients) were in this case VRDeck, Spa, Cryosleep, HomePlanet and Deck. The final model used 32 out of 36 variables and the resampled accuracy without any interactions was close to 0.79.

Now if we repeat the process but we include interaction effects between the variables that were chosen in the previous section:

int_vars <- pen_coef %>%
  str_split_i(., "_", 1) %>%
  unique() %>%
  combn(., 2, simplify = FALSE)

# Map over pairs of vars to create int formula
int_formula <- map_chr(.x = int_vars, .f = \(vp) paste0("starts_with('", vp[1], "'):starts_with('", vp[2], "')")) %>%
  str_flatten(., collapse = "+") %>%
  paste("~", .) %>%
  as.formula(.)

pen_int_rec <- recipe(x = pen_train, vars = my_vars$Variables, roles = my_vars$Roles) %>%
  step_normalize(Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, TotalSpent, CabinNumber, LastNameAsNumber, PassengerGroup,
                 TotalSpentPerGroup) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_interact(int_formula) %>%
  step_zv(all_predictors())

pen_grid <- expand.grid(mixture = c(0.2, 0.6, 1), penalty = c(1e-04, 5e-04, 1e-03, 5e-03))
  
# my_cluster <- makeCluster(detectCores() - 1, type = 'SOCK')
# registerDoSNOW(my_cluster)
# clusterExport(cl = my_cluster, "int_formula")
# 
# set.seed(8584)
# pen_int_tune <- glmnet_mod %>%
#   tune_grid(pen_int_rec, resamples = my_pen_folds, metrics = my_acc, control = pen_ctrl, grid = pen_grid)
# 
# save(pen_int_tune, file = "Penalized regression with interactions.RData")
# 
# stopCluster(my_cluster)
# unregister()

load("Penalized regression with interactions.RData")

We see the results from the penalized regression model with interactions below.

show_best(pen_int_tune, metric = "accuracy", n = 40) %>%
  mutate(mixture = as.factor(round(mixture, 2))) %>%
  ggplot(aes(x = penalty, y = mean, label = mixture, colour = mixture)) +
  geom_line() +
  geom_point() +
  labs(title = "Tune results with interactions", x = "Lambda penalty", y = "Resample accuracy", colour = "Alpha")
Tuning results for a Lasso/Ridge regression with interaction effects.

Figure 6.7: Tuning results for a Lasso/Ridge regression with interaction effects.

The regression model now favours the a mixed Lasso/Ridge (\(\alpha = 0.6\)) but with a higher penalty \(\lambda = 0.005\). Out of 490 total variables including interaction effects, it selected 170 for the best accuracy which was somewhat above 0.80. Compared to the model without interactions effects (accuracy = 0.79), it’s a small improvement.

The top ten variables from the penalized regression are below.

pen_int_best <- fit_best(pen_int_tune)
pen_int_coef <- pen_int_best %>%
  tidy()

pen_int_coef %>%
  select(-penalty) %>%
  filter(estimate != 0 & term != "(Intercept)") %>%
  arrange(desc(abs(estimate))) %>%
  slice(1:20) %>%
  ggplot(., aes(x = abs(estimate), y = reorder(term, abs(estimate)))) +
  geom_col() +
  geom_vline(aes(xintercept = 0.55, colour = "red", linetype = "dashed")) +
  labs(x = "Regression coefficient (abs)", y = "Variable") +
  theme(legend.position = "none")
Top variables from the Lasso/Ridge regression with interaction effects.

Figure 6.8: Top variables from the Lasso/Ridge regression with interaction effects.

It’s surprising to see that the strongest effect was between Deck & Side, which we visually inspected in Figure 6.3 where it looked as if the effect of the interaction wasn’t very large. It also didn’t show up in our pairwise exploration which perhaps indicates that this interaction is overfitted.

The other variables amont the 20 most important were all interaction effects where we see returning contenders like CryoSleep & HomePlanet, CryoSleep & Deck, HomePlanet & Deck and so on. Although the reduction is gradual, the last “big” drop is just above a coefficient of 0.5 (absolute value), indicated by the red line. Let’s add these interactions to a data frame we can use later.

pen_int_vars <- pen_int_coef %>%
  select(-penalty) %>%
  filter(abs(estimate) > 0.55 & term != "(Intercept)") %>%
  filter(str_detect(term, "_x_")) %>%
  mutate(V1 = str_split_i(term, "_", 1),
         V2 = str_split_i(term, "_", -2),
         ForFormula = str_c("starts_with('", V1, "'):starts_with('", V2, "')"),
         RevFormula = str_c("starts_with('", V2, "'):starts_with('", V1, "')")) %>%
  select(V1, V2, ForFormula, RevFormula)

6.5 Tree model interactions

Another method for indirectly discovering interaction effects is by the use of a tree-based model like randomForest. Kuhn and Johnson explain it better in their book but in essence, tree models are by their nature interaction models since they look at thresholds for different variables to split the tree into branches and the splits often involve another variable at the next (or previous) junction.

The results from a randomForest model doesn’t provide interactions themselves but ranks the original variables based on their importance which can be used to model interaction effects in a second stage.

df_tree <- train6 %>%
  mutate(across(.cols = c(PassengerGroupSize, Solo, LargeGroup, TravelTogether, tidyselect::ends_with("PerGroup")),
                .fns = as.integer))

set.seed(8584)
my_tree_split <- initial_split(df_tree, prop = 0.8)
tree_train <- training(my_tree_split)

my_vars <- data.frame(Variables = names(tree_train)) %>%
  mutate(Roles = if_else(Variables %in% c("PassengerId", "Cabin", "Name", "LastName", "PassengerCount", "HomePlanetsPerGroup"),
                         "id", "predictor"),
         Roles = if_else(Variables == "Transported", "outcome", Roles))

tree_rec <- recipe(x = tree_train, vars = my_vars$Variables, roles = my_vars$Roles) %>%
  step_zv(all_predictors()) %>%
  step_normalize(Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, TotalSpent, CabinNumber, LastNameAsNumber, PassengerGroup,
                 TotalSpentPerGroup) %>%
  step_dummy(all_nominal_predictors())

tree_baked <- tree_rec %>%
  prep() %>%
  bake(new_data = NULL) %>%
  select(-c(Cabin, Name, LastName, PassengerCount, HomePlanetsPerGroup))

# rf_mod <- ranger::ranger(Transported ~ . -PassengerId, data = tree_baked, num.trees = 1000, importance = "impurity", 
#                  num.threads = detectCores() - 1, seed = 8584)
# 
# save(rf_mod, file = "Ranger tree model variable importance.RData")
load("Ranger tree model variable importance.RData")

rf_imp <- tibble(Predictor = names(rf_mod$variable.importance),
                 Importance = unname(rf_mod$variable.importance))
ggplot(rf_imp, aes(x = reorder(Predictor, Importance), y = Importance)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  xlab("")
Variable importance based on a the ranger tree model

Figure 6.9: Variable importance based on a the ranger tree model

The tree model heavily favours the amenity variables as well as a few other variables that our penalized model didn’t consider important enough. Based on the ranking in Figure 6.9, there is a clear cutoff after the LastNameCount feature.

If we return to our plotted interactions from from the earlier figures, we can visually inspect some combinations before we select a final set to test. Some interactions that seem large are:

  • Age & RoomService

  • Age & Spa

  • Age & VRDeck

  • Age & CabinNumber

  • Age & PassengerGroup

  • RoomService & Spa

  • RoomService & VRDeck

  • FoodCourt & ShoppingMall

  • FoodCourt & Spa

  • FoodCourt & VRDeck

  • FoodCourt & LastNameAsNumber

  • FoodCourt & PassengerGroup

  • ShoppingMall & Spa

  • ShoppingMall & VRDeck

  • ShoppingMall & CabinNumber

  • ShoppingMall & LastNameAsNumber

  • ShoppingMall & PassengerGroup

  • CabinNumber & LastNameAsNumber

  • CabinNumber & PassengerGroup

Let’s add them to another set of possible interactions!

tree_int_vars <- data.frame(
  V1 = c(rep("Age", 5), rep("RoomService", 2), rep("FoodCourt", 5), rep("ShoppingMall", 5), rep("CabinNumber", 2)),
  V2 = c("RoomService", "Spa", "VRDeck", "CabinNumber", "PassengerGroup", "Spa", "VRDeck", "ShoppingMall", "Spa", "VRDeck",
         "LastNameAsNumber", "PassengerGroup", "Spa", "VRDeck", "CabinNumber", "LastNameAsNumber", "PassengerGroup",
         "LastNameAsNumber", "PassengerGroup")
  ) %>%
  mutate(ForFormula = str_c("starts_with('", V1, "'):starts_with('", V2, "')"),
         RevFormula = str_c("starts_with('", V2, "'):starts_with('", V1, "')"))